Unemployment in the United States 1976-2022

Predicting Unemployment Trends

Author

Whitney Hollman

Published

December 31, 2023

Abstract
The purpose of these findings is to bring a visual and quantiative awareness of high or low employment rates, and what areas of the country are affected the most.

Unemployment in the U.S.

Dataset from kaggle.com

Code
unemployment <- read.csv(here("Project_1", "Unemployment in America Per US State.csv"))

Introduction

Rewrite

This analysis pertains to a dataset that is currently available on Kaggle, and the link will be provided below. The original reason for the author to compile the dataset was to help increase awareness of unemployment trends within various communities, and across the United States.(Jason Oh, n.d.)

I agree with the original author of the dataset that unemployment in this country continues to be a significant issue. Not only an issue to the health of our economy and all of those repercussions, but also this issue can contribute to our shelterless populations.

With a higher level of understanding in tends over time, communities can better prepare for possible dips in unemployment, and the crisis this may have on the members of their communities, as well as the communities themselves.

This projects primary goal is to use trends in unemployment rates, by region and state over time to help predict future trends in unemployment. By predicting unemployment trends it is my hope, and others, that communities can begin to create better outreach programs, and jobs to help improve the lives of persons within their community, who may be destabilized economically through no fault of their own. The dataset spans from 1976-2022.(Jason Oh, n.d.)

One interesting avenue I hope to explore is a relationship between unemployment totals, and their counterpart non-institutional populations. Which population trends higher, are they parallel by state, year. Do they affect each other, etc. Communities with high unemployment and non-institutional populations are more at risk of repeated high unemployment, and it is important for those particular communities to understand what is going on within their prospective populations.

Data sources are sited below, with the links for the dataset website.

Data Source:

https://www.kaggle.com/datasets/justin2028/unemployment-in-america-per-us-state/

Data was taken from the Bureau of Labor Statistics, and complied by Jason Oh the author of the dataset on Kaggle. Data was compiled directly from Bureau of Labor Statistics by the author. The dataset tracks relevant population statistics and employment rates per US state, since 1976.

  1. The Bureau of Labor Statistics’s Economic News Release on (Monthly) State Employment and Unemployment - The Bureau of Labor Statistics has published monthly updates on unemployment rates since January 1976
  2. The Bureau of Labor Statistics’s State Employment and Unemployment Technical Note - The Bureau of Labor Statistics released a detailed overview of their unemployment data, the methodology behind their data, and the proper definitions and terminologies for the variables tracked. The guide mainly provided essential contextual knowledge needed to create a meaningful dataset

Data Preparation

I renamed columns for ease of analysis, did some quick plots, and linear regression. Summary, counting, etc to get a better handle of the data.

Exploratory Analysis

Statistics Being Tracked

Column Names and Variables:

  • FIPS Code of State/Area(Federal Information Processing. Unique codes for states and counties, that are uniquely identified geographically).

  • Year/Month of Statistic

  • Total Civilian Non-Institutional Population in State/Area (All U.S. civilians not residing in institutional group quarters facilities such as correctional institutions, juvenile facilities, skilled nursing facilities, and other long-term care living arrangements. Are unemployed but looking for work)

  • Total Civilian Labor Force in State/Area

  • Percent (%) of State/Area’s Population

  • Total Employment in State/Area

  • Percent (%) of Labor Force Employed in State/Area

  • Total Unemployment in State/Area

  • Percent (%) of Labor Force Unemployed in State/Area

Data Cleaning

Code
unemployment <- read.csv(here("Project_1", "Unemployment in America Per US State.csv"))
Code
if(any(is.na(unemployment))) {
  pring("There are non-finite values in the data set.")
} else {
  print("There are no non-finite values in the data set.")}
[1] "There are no non-finite values in the data set."
Code
unemployment <- unemployment %>%
  clean_names()
Code
sapply(unemployment, class)
                                                fips_code 
                                                "integer" 
                                               state_area 
                                              "character" 
                                                     year 
                                                "integer" 
                                                    month 
                                                "integer" 
total_civilian_non_institutional_population_in_state_area 
                                              "character" 
                 total_civilian_labor_force_in_state_area 
                                              "character" 
                       percent_of_state_area_s_population 
                                                "numeric" 
                           total_employment_in_state_area 
                                              "character" 
            percent_of_labor_force_employed_in_state_area 
                                                "numeric" 
                         total_unemployment_in_state_area 
                                              "character" 
          percent_of_labor_force_unemployed_in_state_area 
                                                "numeric" 
Code
clean_unemployment <- unemployment %>%
  select(-month) 

# Switch year to numeric for ggplots
clean_unemployment$year <- as.numeric(clean_unemployment$year)
Time Series Column Created

I mutated the month and year into one column called state_date, that will essentially allow me to use time series analysis on the data set.

Code
clean_unemployment[1:6,] %>%
  get_one_to_one()
[[1]]
[1] "fips_code"                                                
[2] "state_area"                                               
[3] "total_civilian_non_institutional_population_in_state_area"
[4] "total_civilian_labor_force_in_state_area"                 
[5] "percent_of_state_area_s_population"                       
[6] "total_employment_in_state_area"                           
[7] "percent_of_labor_force_employed_in_state_area"            
[8] "total_unemployment_in_state_area"                         
[9] "percent_of_labor_force_unemployed_in_state_area"          
Code
#Function for removing and converting all columns
remove_commas_and_convert_to_numeric <- function(x) {
 as.numeric(gsub(",", "", x))
}

clean_unemployment <- clean_unemployment %>%
  mutate(across(starts_with("total"), remove_commas_and_convert_to_numeric))
Code
clean_unemployment$region <- case_when(
   clean_unemployment$state_area %in% c("California", "Los Angeles County", "Oregon", "Washington", "Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Montana", "Wyoming", "Alaska", "Hawaii", "Utah") ~ "West",
  clean_unemployment$state_area %in% c("North Dakota", "South Dakota", "Nebraska", "Kansas", "Minnesota", "Iowa", "Missouri", "Wisconsin", "Michigan", "Illinois", "Indiana", "Ohio") ~ "Midwest",
  clean_unemployment$state_area %in% c("Texas", "Oklahoma", "Arkansas", "Louisiana", "Mississippi", "Alabama", "Georgia", "Florida", "South Carolina", "North Carolina", "Virginia", "Tennessee", "Kentucky", "Delaware", "Maryland", "Washington D.C.", "West Virginia", "District of Columbia") ~ "South",
  clean_unemployment$state_area %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "Pennsylvania", "New York", "New York City") ~ "Northeast",
  TRUE ~ "Other"
)
Code
clean_unemployment <- clean_unemployment %>%
  mutate(percentage_total_civilian_non_institutional_pop = round((total_civilian_non_institutional_population_in_state_area / sum(total_civilian_non_institutional_population_in_state_area)) * 100, 4)) 
Code
clean_unemployment <- clean_unemployment %>%
  relocate(region, .after = state_area)
Code
clean_unemployment <- clean_unemployment %>%
  arrange(state_area, year) %>%
  mutate(
    lagged_value = lag(total_unemployment_in_state_area),
    percentage_change = ifelse(
        lagged_value != 0,
(total_unemployment_in_state_area - lagged_value)/lagged_value * 100,
NA)
  ) %>%
  drop_na(percentage_change)

clean_unemployment$percentage_change <- 
  round(clean_unemployment$percentage_change, 1)

Descriptive Statistics and Visualizations

Code
stat_summary <- clean_unemployment %>%
  select(
    total_civil_non_instit_pop = 
      total_civilian_non_institutional_population_in_state_area,
    total_unemployment = 
      total_unemployment_in_state_area,
    percent_unemployment =
      percent_of_labor_force_unemployed_in_state_area,
    percent_state_pop =
      percent_of_state_area_s_population,
    percent_non_instit_pop =
      percentage_total_civilian_non_institutional_pop,
    percentage_change,
    region,
    state_area,
    year,
    total_labor_force = 
      total_civilian_labor_force_in_state_area,
    total_employed =
      total_employment_in_state_area,
    total_unemployed =
      total_unemployment_in_state_area
     )

stat_summary %>%
  summary()
 total_civil_non_instit_pop total_unemployment percent_unemployment
 Min.   :  232000           Min.   :   4980    Min.   : 1.900      
 1st Qu.: 1103972           1st Qu.:  37370    1st Qu.: 4.300      
 Median : 2935000           Median : 103945    Median : 5.500      
 Mean   : 4235583           Mean   : 169550    Mean   : 5.921      
 3rd Qu.: 5390572           3rd Qu.: 210246    3rd Qu.: 7.100      
 Max.   :31236439           Max.   :3018611    Max.   :30.600      
 percent_state_pop percent_non_instit_pop percentage_change     region         
 Min.   :51.00     Min.   :0.000200       Min.   : -95.200   Length:29891      
 1st Qu.:62.80     1st Qu.:0.000900       1st Qu.:  -1.300   Class :character  
 Median :65.90     Median :0.002300       Median :  -0.300   Mode  :character  
 Mean   :65.52     Mean   :0.003346       Mean   :   0.813                     
 3rd Qu.:68.50     3rd Qu.:0.004300       3rd Qu.:   0.900                     
 Max.   :75.70     Max.   :0.024700       Max.   :4262.400                     
  state_area             year      total_labor_force  total_employed    
 Length:29891       Min.   :1976   Min.   :  160022   Min.   :  148718  
 Class :character   1st Qu.:1987   1st Qu.:  731810   1st Qu.:  679548  
 Mode  :character   Median :1999   Median : 1878203   Median : 1750537  
                    Mean   :1999   Mean   : 2734868   Mean   : 2565318  
                    3rd Qu.:2011   3rd Qu.: 3417318   3rd Qu.: 3230672  
                    Max.   :2022   Max.   :19600700   Max.   :18754316  
 total_unemployed 
 Min.   :   4980  
 1st Qu.:  37370  
 Median : 103945  
 Mean   : 169550  
 3rd Qu.: 210246  
 Max.   :3018611  

Summary Statistics for Key Variables

Using summary statistics on four specific variables that I believe to be relevant to the analysis. This gives a brief overview on the similarities and differences between what I believe to be the most important variables.

Code
key_summary <- clean_unemployment %>%
  select(
    total_civil_non_instit_pop = 
      total_civilian_non_institutional_population_in_state_area,
    total_unemployment = 
      total_unemployment_in_state_area,
    percent_unemployment =
      percent_of_labor_force_unemployed_in_state_area,
    percent_state_pop =
      percent_of_state_area_s_population
     )

key_summary %>%
  select(total_civil_non_instit_pop, 
         total_unemployment, 
         percent_unemployment, 
         percent_state_pop) %>%
  summary()
 total_civil_non_instit_pop total_unemployment percent_unemployment
 Min.   :  232000           Min.   :   4980    Min.   : 1.900      
 1st Qu.: 1103972           1st Qu.:  37370    1st Qu.: 4.300      
 Median : 2935000           Median : 103945    Median : 5.500      
 Mean   : 4235583           Mean   : 169550    Mean   : 5.921      
 3rd Qu.: 5390572           3rd Qu.: 210246    3rd Qu.: 7.100      
 Max.   :31236439           Max.   :3018611    Max.   :30.600      
 percent_state_pop
 Min.   :51.00    
 1st Qu.:62.80    
 Median :65.90    
 Mean   :65.52    
 3rd Qu.:68.50    
 Max.   :75.70    

Subsets for High and Low Unemployment Rates per Region

Code
# Average unemployment rate for each state

average_unemployment <- clean_unemployment %>%
  group_by(region) %>%
  summarize(avg_unemployment = mean(total_unemployment_in_state_area),
            avg_labor_force = mean(total_civilian_labor_force_in_state_area),
     yearly_avg_unemployment = mean(percent_of_labor_force_unemployed_in_state_area),
     avg_population = mean(total_civilian_labor_force_in_state_area))

# Top 5 lowest and highest average

top_5_highest <- average_unemployment %>%
  top_n(5, wt = avg_unemployment)

top_5_lowest <- average_unemployment %>%
  arrange(avg_unemployment) %>%
  slice(1:5)

# Subset data for top 5 highest and lowest
unemployment_high_unemployment <- clean_unemployment %>%
  filter(region %in% top_5_highest$region) %>%
  select(region, year, total_unemployment_in_state_area, total_civilian_labor_force_in_state_area) %>%
  arrange(desc(year))
  

unemployment_low_unemployment <- clean_unemployment %>%
  filter(region %in% top_5_lowest$region) 

Visualizations

Visualizations with ggplot2

Visualization 1: Total Unemployment vs. Total Civilian Labor Force

Code
clean_unemployment %>%
  ggplot(mapping = aes(
    x = total_civilian_labor_force_in_state_area,
    y = total_unemployment_in_state_area, color = region)) +
  geom_point() +
  geom_jitter() +
  geom_line() +
  labs(
    x = "Total Civilian Labor Force in State Area per 100k",
    y = "Total Unemployment in State Area per 100k",
    title = "Total Unemployment vs. Total Civilian Labor Force",
    subtitle = "Line Graph",
    caption = "Data Source: Bureau of Labor Statistics",
    fill = "Region") +
  guides(color = guide_legend(title = NULL)) +
  theme_minimal() +
  theme(
    axis.title = element_text(face="bold", size = "12"),
    plot.title = element_text(color = "purple", size = 14, face = "bold"),
    plot.subtitle = element_text(color = "orange", size = 8, face = "bold"),
  legend.position = "top") 

Visualization 2: Total Unemployment vs. Total Civilian Labor Force with Facets

Code
clean_unemployment %>%
  ggplot(mapping = aes(
    x = total_unemployment_in_state_area,
    y = region, color = region)) +
  geom_point() +
  geom_jitter(alpha = 0.15) +
  labs(
    x = "Total Unemployment By Region",
    y = "Region",
    title = "Total Unemployment vs. Region",
    subtitle = "Facet Plot",
    caption = "Data Source: Bureau of Labor Statistics") +
   guides(color = guide_legend(title = NULL)) +
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 90, hjust = 1)) +
  theme(legend.position = "none",
        legend.title = element_blank(),
        axis.title = element_text(size = 12, face = "bold"),
        
        plot.title = element_text(color = "purple", size = 14, 
                                  face = "bold"),
        plot.subtitle = element_text(color = "orange", size = 8, face = "bold")) 

Visualization 1 and 2 Demonstrate:

The line graph shows us that as the total civilian labor force increases, so does the total unemployment. This makes sense, as the more people that are in the labor force, the more people that are unemployed. The facet plot shows us that the West region has the highest total unemployment, while the Midwest region has the lowest total unemployment.

Visualization 3: Histogram-Density Plot of Average Unemployment

Code
set.seed(1234)
average_unemployment %>%
  ggplot(mapping = aes(
    x = avg_unemployment)) +
  geom_histogram(aes(y = ..density..), bins = 18, fill = "#33CCCC", color = "black") +
  geom_density(color = "#669900", size = 1.8) +
  geom_vline(aes(xintercept = mean(avg_unemployment)), color = "#FF6666", linetype = "dashed", size = 2) +
  labs(
    x = "Average Unemployment for Each State",
    y = "Density",
    title = "Distribution of Average Unemployment Across States",
    subtitle = "Smoothed Density and Histogram Plot",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_minimal() +
  theme(
    axis.title = element_text(size = 12, face = "bold"),
    plot.title = element_text(color = "purple", size = 16, margin = margin(b = 20), face = "bold"),
    plot.subtitle = element_text(color = "#FF9933", face = "bold", size = 10) )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Visualization 3 Shows:

This histogram represents the probability distribution of the average unemployment rates for each state. The density plot provides a smooth and continuous view of the estimated data distribution.

Average unemployment seems to be decreasing over time for most of the states in the US. This is a good sign that the economy is improving. However, there are still some states that have a higher average unemployment rate than others. This could be due to a number of factors, such as the state’s economy, the state’s population, and the state’s unemployment rate.

Visualization 4: Explore Relationships Unemployment and Non-Institutional Population per State Area

Code
clean_unemployment %>%
  group_by(region) %>%
  ggplot(aes(
    x = total_civilian_non_institutional_population_in_state_area,
    y = percent_of_labor_force_unemployed_in_state_area,
    color = region)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_jitter() +
  labs(
    x = "Non-Insitutional Population in State Area",
    y = "Percent of Labor Force Unemployed in State Area",
    title = "Percent of Labor Force Unemployed vs. Non-Instituional Population",
    subtitle = "Linear Regression Model",
    caption = "Data Source: Bureau of Labor Statistics") +
    theme_classic() +
   theme(
    legend.position = "bottom",
    legend.title = element_blank(),
    plot.title = element_text(color = "purple", size = 14, face = "bold"),
    plot.subtitle = element_text(color = "#FF9933", size = 8, face = "bold"),
    axis.title = element_text(color = "black", size = 8, face = "bold"))
`geom_smooth()` using formula = 'y ~ x'

Visualization 4 Shows:

This plot shows that the West and South have more significant changes in the percent of unemployed population data. With the West having a more positive slope and the South having a more negative slope. The Midwest and Northeast have a more neutral slope. Which means that in the Western States there is a positive relationship between unemployment and non institutional population. In the South there is a negative relationship between unemployment and non institutional population. In the Midwest and Northeast there is no relationship between unemployment and non institutional population.

Visualization 5: Highest and Lowest Unemployment Over Time (1976-2019) per Region

Code
# Non-Institutional Population 
clean_unemployment %>%
  filter(year >= 1976,
         region %in% c('West', 'South','Midwest','Northeast')) %>% 
  ggplot(aes(
    y = total_civilian_non_institutional_population_in_state_area,
    x = year,
    color = region)) +
  geom_point(alpha = 0.25) +
  geom_jitter() +
  labs(
    x = "Year (1976-2022)",
    y = "Non_Institutional Population",
    shape = "Year",
    title = "Total Non-Institutional Population vs. Year (1976-2022)",
    subtitle = "Linear Regression Model",
    caption = "Data Source: Bureau of Labor Statistics") +
  guides(color = guide_legend(title = NULL)) +
  theme_classic() +
  theme(
    legend.position = "bottom",
    plot.subtitle = element_text(color = "#FF9933", size = 8, face = "bold"),
    plot.title = element_text(color = "purple", size = 16, face = "bold"),
    axis.title = element_text(color = "black", size = 12, face = "bold")) +
  scale_x_continuous(breaks = seq(1976, 2022, 4)) +
  scale_y_continuous(breaks = seq(0, 100000000, 10000000)) +
  scale_color_manual(values = c("#33CCFF", "#ff3399", "#33CC00", "#9900cc"))

Code
# Unemployment Population
 clean_unemployment %>%
  filter(year >= 1976,
         region %in% c('West', 'South','Midwest','Northeast')) %>%
  ggplot(aes(
    x = year,
    y = total_unemployment_in_state_area,
    color = region)) +
  geom_point(alpha = 0.25) +
  geom_jitter() +
  labs(
    x = "Year (1976-2022)",
    y = "Total Unemployment By Region",
    shape = "Year",
    title = "Total Unemployment in Region vs. Year (1976-2022)",
    subtitle = "Linear Regression Model",
    caption = "Data from the Bureau of Labor Statistics") +
  guides(color = guide_legend(title = NULL)) +
  theme_classic() +
  theme(
    legend.position = "bottom",
    plot.subtitle = element_text(color = "#FF9933", size = 8, face = "bold"),
    plot.title = element_text(color = "purple", size = 16, face = "bold"),
    axis.title = element_text(color = "black", size = 12, face = "bold")) +
  scale_x_continuous(breaks = seq(1976, 2022, 4)) +
  scale_y_continuous(breaks = seq(0, 10000000, 1000000)) +
  scale_color_manual(values = c("#33CCFF", "#ff3399", "#33CC00", "#9900cc"))

Visualization 5 Shows:

The West has the highest unemployment rate and the highest non-institutional population. The South has the second highest unemployment rate and the second highest non-institutional population. The Midwest has the third highest unemployment rate and the third highest non-institutional population. The Northeast has the lowest unemployment rate and the lowest non-institutional population. Despite the fact that non-institutional population is higher in the West and South, the Northeast has the lowest unemployment rate. This could be due to the fact that the Northeast has a higher percentage of the population that is employed. As far as unemployment is concerned the Northeast is the best region to live in.

Variable Relationships

Code
unemployment_sub <- clean_unemployment %>%
  select(
    percent_of_state_area_s_population,
    percent_of_labor_force_employed_in_state_area,
    percent_of_labor_force_unemployed_in_state_area,
    percentage_total_civilian_non_institutional_pop,
    percentage_change,
    year
    ) %>%
    sample_n(300)
Code
unemployment_sub %>%
  select(
    percent_state_area = 
      percent_of_state_area_s_population,
    percent_labor_employed =
      percent_of_labor_force_employed_in_state_area,
    percent_labor_unemployed = 
      percent_of_labor_force_unemployed_in_state_area,
    percent_non_instit_pop = 
      percentage_total_civilian_non_institutional_pop
  )
Code
pair_plot <- unemployment_sub %>%
  ggpairs(ggplot2::aes(fill = "#3399ff")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution 1",
    subtitle = "Scatterplot Matrix",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_linedraw() +
  theme(
  plot.title = element_text(color = "#3399cc", size = 12, face = "bold"),
  plot.subtitle = element_text(color = "#cc6600", size = 8, 
                               face = "bold"),
  axis.title = element_text(color = "black", size = 8, face = "bold")) +
  scale_fill_manual(values = c("#3399ff"))
pair_plot

Code
#Distribution of Correlation 2
unemployment_sub_2 <- clean_unemployment %>%
  select(percent_non_instit_pop = 
           percentage_total_civilian_non_institutional_pop,
         percent_state_area = 
           percent_of_state_area_s_population,
        ) %>%
  sample_n(300)

pair_plot_2 <- unemployment_sub_2 %>%
  ggpairs(ggplot2::aes(fill = "#3399ff")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution 2",
    subtitle = "Scatterplot Matrix",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_linedraw() +
  theme(
  plot.title = element_text(color = "#003399", size = 12, face = "bold"),
  plot.subtitle = element_text(color = "#cc6600", size = 8, face = "bold"),
  axis.title = element_text(color = "black", size = 8, face = "bold")) +
  scale_fill_manual(values = c("#3399ff"))
pair_plot_2

Code
#Distribution of Correlation 3
unemployment_sub_3 <- clean_unemployment %>%
  select(percent_state_area = 
           percent_of_state_area_s_population,
         percent_labor_unemployed =
         percent_of_labor_force_unemployed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_3 <- unemployment_sub_3 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution 3",
    subtitle = "Scatterplot Matrix",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_linedraw() +
  theme(
 plot.title = element_text(color = "#003399", size = 12, face = "bold"),
  plot.subtitle = element_text(color = "#cc6600", size = 8, face = "bold"),
  axis.title = element_text(color = "black", size = 8, face = "bold")) +
  scale_fill_manual(values = c("#3399ff"))
pair_plot_3

Code
#Distribution of Correlation 4
unemployment_sub_4 <- clean_unemployment %>%
  select(percent_labor_employed =
    percent_of_labor_force_employed_in_state_area,
         percent_non_instit_pop =
         percentage_total_civilian_non_institutional_pop,
        ) %>%
  sample_n(300)

pair_plot_4 <- unemployment_sub_4 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution 4",
    subtitle = "Scatterplot Matrix",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_linedraw() +
  theme(
  plot.title = element_text(color = "#003399", size = 12, face = "bold"),
  plot.subtitle = element_text(color = "#cc6600", size = 8, face = "bold"),
  axis.title = element_text(color = "black", size = 8, face = "bold")) +
  scale_fill_manual(values = c("#3399ff"))
pair_plot_4

Code
#Distribution of Correlation 5
unemployment_sub_5 <- clean_unemployment %>%
  select( percent_labor_employed =
    percent_of_labor_force_employed_in_state_area,
         percent_labor_unemployed =
         percent_of_labor_force_unemployed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_5 <- unemployment_sub_5 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution 5",
    subtitle = "Scatterplot Matrix",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_linedraw() +
  theme(
  plot.title = element_text(color = "#003399", size = 12, face = "bold"),
  plot.subtitle = element_text(color = "#cc6600", size = 8, face = "bold"),
  axis.title = element_text(color = "black", size = 8, face = "bold")) +
  scale_fill_manual(values = c("#3399ff"))
pair_plot_5

Code
#Distribution of Correlation 6
unemployment_sub_6 <- clean_unemployment %>%
  select(percent_labor_unemployed =
    percent_of_labor_force_unemployed_in_state_area,
         percent_non_instit_pop =
         percentage_total_civilian_non_institutional_pop,
        ) %>%
  sample_n(300)

pair_plot_6 <- unemployment_sub_6 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution 6",
    subtitle = "Scatterplot Matrix",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_linedraw() +
  theme(
 plot.title = element_text(color = "#003399", size = 12, face = "bold"),
  plot.subtitle = element_text(color = "#cc6600", size = 8, face = "bold"),
  axis.title = element_text(color = "black", size = 8, face = "bold")) +
  scale_fill_manual(values = c("#3399ff"))
pair_plot_6

The variable relationships show: that the significant correlations are distrubutions

Predictor Response Relationships

Code
unemployment_sub <- clean_unemployment %>%
  select(percent_of_state_area_s_population,
         percent_of_labor_force_employed_in_state_area,
         percent_of_labor_force_unemployed_in_state_area,
         total_unemployment_in_state_area,
         total_civilian_labor_force_in_state_area,
         percentage_total_civilian_non_institutional_pop
         ) %>%
  sample_n(300)

unemployment_sub %>%
  ggplot(mapping = aes(
    x = percent_of_state_area_s_population,
    y = total_unemployment_in_state_area)) +
  geom_point() +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percent of State Area's Population",
    y = "Total Unemployment in State Area",
    title = "Unemployment Population vs. Percent of State Population",
    subtitle = "Linear Regression Model 1",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_classic() +
  theme(
  plot.title = element_text(color = "darkblue", size = 16, face = "bold"),
    axis.title = element_text(color = "darkblue", size = 12, face = "bold"))
`geom_smooth()` using formula = 'y ~ x'

Code
unemployment_sub %>%
  ggplot(mapping = aes(
    x = percent_of_labor_force_employed_in_state_area,
    y = total_unemployment_in_state_area)) +
  geom_point() +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percent of Labor Force Employed in State Area",
    y = "Total Unemployment in State Area",
    title = "Unemployment Population vs. Percent of Labor Force Employed",
    subtitle = "Linear Regression Model 2",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_classic() +
  theme(
    plot.title = element_text(color = "darkblue", size = 16, face = "bold"),
    axis.title = element_text(color = "darkblue", size = 12, face = "bold"))
`geom_smooth()` using formula = 'y ~ x'

Code
unemployment_sub %>%
  ggplot(mapping = aes(
    x = percent_of_labor_force_unemployed_in_state_area,
    y = total_civilian_labor_force_in_state_area)) +
  geom_point() +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percent of Labor Force Unemployed in State Area",
    y = "Total Civilian Labor Force in State Area",
    title = "Civilian Labor Force vs. Percent of Labor Force Unemployed",
    subtitle = "Linear Regression Model 3",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_classic() +
  theme(
    plot.title = element_text(color = "darkblue", size = 16, face = "bold"),
    axis.title = element_text(color = "darkblue", size = 12, face = "bold")
  )
`geom_smooth()` using formula = 'y ~ x'

Code
# Despite the correlation above in the pair plots, there does seem to be a positive relationship between the two variables. Unemployment and the Population in prison are both increasing over time.
unemployment_sub %>%
  ggplot(aes(
    x = percentage_total_civilian_non_institutional_pop,
    y = total_unemployment_in_state_area)) +
  geom_point() +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percentage of Total Non-Institutional Population",
    y = "Total Unemployment in State Area",
    title = "Total Unemployment in State Area vs. Percentage of Non-Institutional Population",
    subtitle = "Linear Regression Model 4",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_classic() +
  theme(
    plot.title = element_text(color = "darkblue", size = 16, face = "bold"),
    axis.title = element_text(color = "darkblue", size = 12, face = "bold")
  )
`geom_smooth()` using formula = 'y ~ x'

Model 4 shows a positive and significant relationship between non-institutional population and unemployment. This is a good sign for our model, as it shows that the data is not random and that there is a relationship between the variables. Policies that are related to the non-institutional population can be related to unemployment, and it is important to understand the size of both populations, in order for policymakers to assess the impact of their initiatives on a much broader scale. Both populations defined show the potential of the size of the labor force, if all were to be employed.

Top States with the Highest Unemployment Rate by Year and Comparison of Non-Institutional Population rates by the same Year

Code
top_5_states <- unemployment %>%
  group_by(year, state_area) %>%
  summarize(
    max_unemployment = max(percent_of_labor_force_unemployed_in_state_area),
    max_labor_force = max(percent_of_labor_force_employed_in_state_area),
    .groups = "keep"
  ) %>%
  arrange(desc(max_unemployment))
Code
print(top_5_states)
# A tibble: 2,491 × 4
# Groups:   year, state_area [2,491]
    year state_area         max_unemployment max_labor_force
   <int> <chr>                         <dbl>           <dbl>
 1  2020 Nevada                         30.6            61.7
 2  2020 Hawaii                         22.6            59.9
 3  2020 Michigan                       22.6            59.2
 4  2020 New York city                  21.4            57.8
 5  2020 Los Angeles County             18.8            61.9
 6  1983 West Virginia                  18.4            43.7
 7  2020 Illinois                       18              61.9
 8  2020 Rhode Island                   18              62.2
 9  1982 West Virginia                  17.9            47.3
10  2020 Massachusetts                  16.9            64.6
# ℹ 2,481 more rows
Code
mu_unemployment_sd <- clean_unemployment %>%
  group_by(year, region) %>%
  summarize(across(c(percent_of_labor_force_unemployed_in_state_area),
            list(mu = ~ mean(.), sigma = ~ sd(.))),
            .groups = "keep") %>%
  arrange(desc(percent_of_labor_force_unemployed_in_state_area_mu))
Code
print(mu_unemployment_sd)
# A tibble: 235 × 4
# Groups:   year, region [235]
    year region percent_of_labor_force_unemployed_in_st…¹ percent_of_labor_for…²
   <dbl> <chr>                                      <dbl>                  <dbl>
 1  2020 Other                                      12.4                  5.45  
 2  1976 Other                                      11.1                  0.0853
 3  1992 Other                                      11.1                  0.508 
 4  1993 Other                                      10.4                  0.480 
 5  1977 Other                                      10.2                  0.316 
 6  2021 Other                                      10.0                  1.48  
 7  1983 South                                      10.0                  2.78  
 8  1983 Other                                       9.82                 0.404 
 9  1982 Other                                       9.72                 0.336 
10  1982 South                                       9.70                 2.33  
# ℹ 225 more rows
# ℹ abbreviated names: ¹​percent_of_labor_force_unemployed_in_state_area_mu,
#   ²​percent_of_labor_force_unemployed_in_state_area_sigma

The above summary shows that the top 5 states with the highest unemployment rate by year are: Nevada, Michigan, California, Rhode Island, and Illinois. The summary also shows that the states with the highest unemployment rate also have the highest non-institutional population. This is a good sign for our model, as it shows that the data is not random and that there is a relationship between the variables. Policies that are related to the non-institutional population can be related to unemployment, and it is important to understand the size of both populations, in order for policymakers to assess the impact of their initiatives on a much broader scale. Both populations defined show the potential of the size of the labor force, if all were to be employed.

Time Series Analysis of Percdentage Change in Unemployment Rate by Year.

Code
ggplot(mu_unemployment_sd, aes(x = year, y = percent_of_labor_force_unemployed_in_state_area_mu)) +
  geom_point(color = "blue") +
  geom_line(color = "darkblue") +
  geom_errorbar(aes(ymin = percent_of_labor_force_unemployed_in_state_area_mu - percent_of_labor_force_unemployed_in_state_area_sigma,
                    ymax = percent_of_labor_force_unemployed_in_state_area_mu + percent_of_labor_force_unemployed_in_state_area_sigma),
                width = 0.2) +
  facet_wrap(~region, ncol = 2) +
  labs(
    x = "Year",
    y = "Percent of Labor Force Unemployed By Region",
    title = "Percent of Labor Force Unemployed in Region By Year",
    caption = "Data Source: Bureau of Labor Statistics") +
  theme_minimal() +
  theme(
    plot.title = element_text(color = "darkblue", size = 16, face = "bold"),
    axis.title = element_text(color = "darkblue", size = 12, face = "bold")
  )

By analyzing the data by the mean and standard deviation, we can see that the unemployment rate is different in each region, but with only slight differences. Which tells us that for the most part the same trend in unemployment in one region, will most likely be seen by the other regions. More or less. The graph also shows the 2020 spike of unemployment due to the pandemic. Which when you examine the data, and previous linear graphs, we had not had a spike that high since the 1980’s.

Top States With The Percentage Change in Unemployment Rate by Year

The graph below shows the percentage change in unemployment rate by year.

Code
increase_unemployment_year <- clean_unemployment %>%
  filter(percentage_change > 0) %>%
  select(year, state_area, percentage_change) %>%
  arrange(desc(percentage_change))
 

plot <- ggplot(increase_unemployment_year, aes(x = year, y = percentage_change, group = state_area, color = state_area)) +
  geom_line(linewidth = 1) +
  labs(
    x = element_text("Year", size = 10, face = "bold"),
    y = element_text("Percentage Change", size = 8, face = "bold"),
    title = "Percentage Change in Unemployment Rate by Year",
    caption = "Data from the Bureau of Labor Statistics"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",  # Remove legend
    plot.title = element_text(color = "#660066", size = 16, face = "bold"),   # Adjust the size as needed
    axis.text.y = element_text(hjust = 1, margin = margin(b = 40),
                               size = 8, face = "bold"),
    axis.text.x = element_text(color = "#660066", angle = 45, hjust = 1, margin = margin(b = 40), size = 12, face = "bold"),
    axis.title = element_text(color = "#660066", size = 12, face = "bold")
  ) +
  scale_color_viridis_d() +  # Use a color palette from the viridis package
  facet_wrap(~state_area, scales = "free_y") +  # Facet by state_area with independent y-axes
  scale_y_continuous(trans = "log10")  # Use a log scale for y-axis
plot

As I mentioned with the last graph, the 2020 spike in unemployment is the highest we have seen since the 1980’s. The graph above shows the top states with the highest percentage change in unemployment rate. The states with the highest percentage change in unemployment rate are: Nevada, Michigan, California, Rhode Island, and Illinois. These states also had the highest unemployment rate by year.

Brief Expected Conclusions and Models/Techniques Used

The expected conclusion of this project is to show that there is a relationship between specific region populations and unemployment trends. The models and techniques used to show this relationship are: linear regression, and data visualization. I used a time series analysis to determine if there is a relationship between two variables over time. I used data from the Bureau of Labor Statistics and the Bureau of Justice Statistics to determine the unemployment rate.

Preliminary Results with Model Selection

Model fit for chosen data sets, to get a ‘sense’ of the problem. Followed by what is believed to be the best model to refine and test original hypothesis, data analysis and exploration through plots.

Split and Test Various Data Sets

Code
set.seed(93422)

unemployment_split <- clean_unemployment %>% 
  initial_split(prop = 0.9)

unemployment_train <- unemployment_split %>% training()
unemployment_test <- unemployment_split %>% testing()

dim(clean_unemployment)
[1] 29891    14
Code
dim(unemployment_train)
[1] 26901    14
Code
dim(unemployment_test)
[1] 2990   14
Code
#|label: fit the models for training data

lr_mod <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")
Code
unemployment_lm1 <- lr_mod %>%
  fit(year ~ percentage_change, data = unemployment_train)

unemployment_lm2 <- lr_mod %>%
  fit(year ~ poly(percentage_change, 2), data = unemployment_train)

unemployment_test <- unemployment_test %>%
  mutate(
    pred_1 = predict(unemployment_lm1,
                     new_data= unemployment_test,
                     type = "raw"),
    pred_2 = predict(unemployment_lm2, 
                     new_data = unemployment_test,
                     type = "raw")
  )

unemployment_test %>%
  rmse(truth = percentage_change, estimate = pred_1)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1999.
Code
unemployment_test %>%
  rmse(truth = percentage_change, estimate = pred_2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1999.
Code
unemployment_lm1 <- lr_mod %>%
  fit(year ~ percent_of_labor_force_unemployed_in_state_area, data = unemployment_train)

unemployment_lm2 <- lr_mod %>%
  fit(year ~ poly(percent_of_labor_force_unemployed_in_state_area, 2), data = unemployment_train)

unemployment_test %>%
  rmse(truth = percent_of_labor_force_unemployed_in_state_area, estimate = pred_1)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1993.
Code
unemployment_test %>%
  rmse(truth = percent_of_labor_force_unemployed_in_state_area, estimate = pred_2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1993.

Cross Validation and KNN Model

Code
set.seed(1234)
cvs <- vfold_cv(clean_unemployment, v = 5, strata = region)

cvs_recipe <- recipe(region ~ ., data = clean_unemployment) %>%
  step_rm(state_area) %>%
  step_normalize(all_predictors())

cvs_recipe
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 13
── Operations 
• Variables removed: state_area
• Centering and scaling for: all_predictors()
Code
knn_mod <- nearest_neighbor(neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(knn_mod)

knn_fit <- knn_wflow %>%
  fit_resamples(cvs)
Code
knn_fit <- knn_wflow %>%
  fit_resamples(cvs, 
                metrics = metric_set(roc_auc, accuracy, precision))

knn_fit %>%
  collect_metrics()
# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config             
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy  multiclass 0.982     5 0.000510 Preprocessor1_Model1
2 precision macro      0.985     5 0.000488 Preprocessor1_Model1
3 roc_auc   hand_till  0.998     5 0.000172 Preprocessor1_Model1

I tried various folds to make sure the model was not over-fitting. I also tried various neighbors to see which one would give me the best results. I found that the model was not over-fitting and that the best number of neighbors was 5. I also found that the model did not over-fit when I used the percentage change in unemployment rate and the percent of labor force unemployed in state area. Use tidyModels textbook to check the prediction with original value of particular columns.

Predictive Models

Code
#|label: Decision Tree Model
max_depth <- 5
set.seed(93422)

tree_mod <- decision_tree(tree_depth = max_depth) %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(tree_mod)
Code
#|label: tree fit

tree_fit <- tree_wflow %>%
  fit_resamples(
    cvs, metrics = metric_set(accuracy, roc_auc, precision)
    )

tree_fit_results <- tree_wflow %>%
  fit(clean_unemployment)
Code
#|label: Inspect the fit of the tree model

tree_fitted <- tree_fit_results %>%
  extract_fit_parsnip()

rpart.plot(tree_fitted$fit, roundint = FALSE)

Code
tune_grid <- grid_regular(
  cost_complexity(),
  tree_depth(range = c(1, 5)),
  min_n(range = c(5, 20)),
  levels = 2)

tune_grid
# A tibble: 8 × 3
  cost_complexity tree_depth min_n
            <dbl>      <int> <int>
1    0.0000000001          1     5
2    0.1                   1     5
3    0.0000000001          5     5
4    0.1                   5     5
5    0.0000000001          1    20
6    0.1                   1    20
7    0.0000000001          5    20
8    0.1                   5    20
Code
tree_mod <- decision_tree(cost_complexity = tune(),
                          tree_depth = tune(),
                          min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(tree_mod)

tree_grid_search <- 
  tune_grid(
    tree_wflow,
    resamples = cvs,
    grid = tune_grid
  )

tuning_metrics <- tree_grid_search %>%
  collect_metrics()
Code
best_accuracy <- tuning_metrics %>%
  filter(.metric == "accuracy") %>%
  slice_max(mean)

best_roc_auc <- tuning_metrics %>%
  filter(.metric == "roc_auc") %>%
  slice_max(mean)

best_accuracy
# A tibble: 2 × 9
  cost_complexity tree_depth min_n .metric  .estimator  mean     n std_err
            <dbl>      <int> <int> <chr>    <chr>      <dbl> <int>   <dbl>
1    0.0000000001          5     5 accuracy multiclass 0.659     5 0.00241
2    0.0000000001          5    20 accuracy multiclass 0.659     5 0.00241
# ℹ 1 more variable: .config <chr>
Code
best_roc_auc
# A tibble: 2 × 9
  cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
            <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
1    0.0000000001          5     5 roc_auc hand_till  0.907     5 0.00104
2    0.0000000001          5    20 roc_auc hand_till  0.907     5 0.00104
# ℹ 1 more variable: .config <chr>

Final Conclusions

The primary goal of this project was to find trends in unemployment rates, by region and state, over time to help predict future trends in unemployment rates. Something that was beyond the scope of this project, but I still believe, a crucial point to be aware of. The relationship between unemployment rates and homelessness. According to research completed by a Professor of Economics at Columbia University who used data that was collected on both homelessness rates and unemployment. They found that for every increase of 1% in the unemployment rate, homelessness per 10,0000 people would increase by 0.65. At the time of the study in April of 2020 the model predicted an estimated 800,000 Americans would be homeless by summer. (Community Solutions and Dr. Brendan O’Flaherty, n.d.)

Why does any of this matter? Our analysis accurately predicted trends over time, for each region, state and counties in those states. Our models predicted with an accuracy of 99% for the decision tree model and 98% for the KNN model. These trends of the data over time, along with the high accuracy models allows us to predict future trends in unemployment rates. This is important because it allows us to also predict future trends in homelessness. Communities and struggling Americans do not have to suffer from homelessness caused by high unemployment. Policy makers and community leaders can use this information to help prevent these issues and put programs in place to help those who are struggling. As you saw from many different types of plots and graphs, that unemployment highs and lows were relatively consistent over time, the degree of change was not.

Code
clean_unemployment1 <- clean_unemployment %>%
  filter(year >= 2019 & year <= 2022) %>%
  select(year, state_area, percentage_change) %>%
  arrange(desc(year))
clean_unemployment1
Code
#|label: pulling out one state and looking at its percentage change from years 2019-2022

clean_unemployment2 <- clean_unemployment %>%
  filter(state_area == "California") %>%
  select(year, state_area, percentage_change) %>%
  arrange(desc(year))
Code
#average percentage change for California
mean(clean_unemployment2$percentage_change)
[1] 3.330851

The mean for California during the years 2019-2022 is 3.33. This means that the unemployment rate is expected to increase by 3.33% over the next 3 years. This is a significant increase and will have a large impact on homelessness rates.

Several suggestions for helping states lower their homelessness rates while there is a high unemployment included: a national moratorium on evictions, a national moratorium on foreclosures, and a national moratorium on utility shut-offs. (Community Solutions and Dr. Brendan O’Flaherty, n.d.) These suggestions would help prevent homelessness and help those who are struggling to pay their bills. By looking at non-institutionalized population, we can see that the number of people who are unemployed but actively looking for work has increased over time steadily for the past 10 years in the harder hit unemployment regions such as the West. This is important because it shows that the unemployment rate is not just increasing because people are not looking for work, but because there are not enough jobs available.

In sum, this dataset and predictive analysis can be used to help predict future trends in unemployment rates and by consequence spikes in homelessness rates as well. These two issues are not independent of each other, and when policymakers are addressing one, they also need to be considering the second issue, this problem is holistic and should be treated as such. Several of our predictive models showed, nationwide, that the entire country will be affected by the increase in unemployment rates in one region. This is not just a regional issue, but a national one.

References

Community Solutions, and Dr. Brendan O’Flaherty. n.d. “Analysis on Unemployment Prjocts 40-45% Increase In Homelessness This Year.” https://community.solutions/analysis-on-unemployment-projects-40-45-increase-in-homelessness-this-year/.
Jason Oh. n.d. “Unemployment in America Per US State.” https://www.kaggle.com/datasets/justin2028/unemployment-in-america-per-us-state.